Today we will be focusing on the practice of geospatial data visualization.
As a reminder, our framework for the workflow of data visualization is shown in Figure 3.1
Figure 2.1: Tidyverse
2.1 Load and Install Packages
As a warmup, let’s load the tidyverse. Remember, you can check to make sure a package is loaded in your R session by checking on the files, plots, and packages panel, clicking on the Packages tab, and scrolling down to tidyverse to make sure it is checked.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.2 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
2.1.1 Geospatial Packages
R has multiple packages that enable geospatial data visualization. Today, we create basic maps using the sf and leaflet. sf stands for Simple Features which is an open-source geographic information systems data format that works pretty well with tidyverse. Leaflet is an open-source library for mobile-friendly interactive maps.
As a reminder, when we first use a package, we need to install it locally. Let’s start by installing both packages. This only needs to be done once.
install.packages('sf')install.packages('leaflet')
Next, we load the packages to make sure they are available within the R environment.
library(sf)library(leaflet)
Installing and loading packages is EZ!
2.2 Acquiring Data
We are going to acquire two datasets for testing today. The first dataset is the nc shapefile dataset that should be installed as part of the sf package. Unlike mpg, this dataset is not directly available in the global environment.
We also will import geospatial environmental data set for our test examples. We are going to use a curated version of the CalEnviroscreen4.0 data. The data for the full state of California can be acquired as a zipped esri shapefile .shp. I’ve hosted a smaller version just for SoCal on my gitHub repository for this analysis. Downloading, unzipping, and then importing files is a big part of data science. Unfortunately, it has been my experience that having 15 people all have the same directory and file structures on 15 different machines is fraught with peril. I don’t want to troubleshoot that for an hour in class, so I tried to make the EZ mode code below.
2.2.1 North Carolina shapefile
We will read the data in using the sf function st_read and the base R function system.file.
st_read() is a specialized function that reads and loads in geospatial data files of various formats.
system_file() is a function that checks for files that are within system directories that have been loaded as packages. It searches within the package directories on your computer.
st_transform() is required to display the polygons properly in leaflet in the WGS84 coordinate reference system.
We assign the North Carolina data we’re reading into a dataset using the <- operator. The <- operator tells R that the data we are loading should be placed into a dataset that we have named nc. In future functions, nc will access that underlying data within nc.
Lastly, there is the pipe operator |>. These pipes are amazing coding features that transformed coding for me. I cannot express my love for the |> in words alone. It allows existing data or information to be passed to the next line of code after some operation has been done on the data. The pipe operator says, ‘take the output from this line of code, and now do this next thing’. I like to think of them as Super Mario pipes that Mario enters and is transported to another place. For now, take it on faith that this is the most useful operator in the tidyverse and will be littered throughout this course.
The second dataset is hosted on the github repository where the class materials are version controlled. The code below names the internet URL where the data is stored as URL.path. If you go to the URL link, you will see the raw .geoJSON formatted spatial data file.
Next, we apply the same st_read() function to read the geoJSON file. geoJSON is a common format for encoding geographic data structures, and st_read() natively knows how to read it in. The st_transform() function again changes the polygons into the WGS84 coordinate reference system.
This time, we’ve named the dataset SoCalEJ.
Finally, we use the head() function on our SoCalEJ dataset to look at the first 5 rows.
Sometimes the data import messes up at the end of the dataset. A thorough scholar will use the tail function to check the bottom 5 rows as well.
In future classes, we may dive into the perilous world of file structures and directories. However, that is unfun, unrewarding, and gruesomely detail oriented, so we’ll avoid it for now.
2.2.3 Creating a New Geospatial Dataset
In addition to importing data, sometimes we just to create our own dataset. Let’s demonstrate how to create a very simple data frame in R for the latitude and longitude of a couple different locations we want to put on a map.
I want to show the location of this classroom at the Redford Conservancy and the neighborhood where I live.
I can use Google Earth or similar websites to get the latitude and longitude data. Latitude is North and South. Longitude is East and West.
My zip code 92508 has a centroid at 33.8895145 N and -117.319014 W.
The code below will create a data frame which is an R data structure that has rows and columns. A data frame is a generic data object that can store tabular data of many different types.
We’ll use two base R functions to create the data frame, c() and data.frame().
c() - this function concatenates data. It is used to create a list.
’data.frame()` - this function turns a list (or lists) into a data.frame type.
We assign latitude values to a variable named lat, and a longitude values to a variable named lng. We use c() because we have multiple values that we want to include. We combine them into the locations data.frame. Then we check it worked by typing locations.
lat <-c(34.1100576, 33.8895145)lng <-c(-117.710074, -117.319014)locations <-data.frame(lat, lng) locations
lat lng
1 34.11006 -117.7101
2 33.88951 -117.3190
2.2.3.1 Exercise 1
Find another location’s latitude and longitude coordinates that you want to add to a map.
Hack the code to add your location of interest to the two existing locations data.frame.
Display the output of the locations data.frame to show that your code worked!
2.3 Visualize the Data - Geospatial Edition
2.3.1 Choose the visualization type
We’ll be exploring leaflet() maps for our visualizations. Leaflet has a number of in-built display functions for geospatial data that make things look really cool. I will note that ggplot can make maps too with geom_sf() and other functionality, but I prefer the leaflet tiles for interactive maps as a superior visualization tool.
2.3.2 Choose the visualization function
leaflet()
addTiles()
addMarkers()
addPolygons()
addLegends()
We’ll apply leaflet() to every visualization for geospatial data today, and add at least one add function to display spatial information.
2.3.3 Write the code for a basic geospatial visualization
Leaflet is a super cool package for making interactive maps with very few lines of code. As before let’s start with basics and then iterate.
Note that Leaflet has a different style and coding aesthetic than ggplot, so some of the syntax and grammar is a bit different. However, the steps are the same as for ggplot.
Add a verb or two for an action - usually this is the function.
Add an object to apply the action to - this is usually the dataframe, but can be a list or another type of data object.
Add adjectives and adverbs to modify the action or the object
2.3.3.1 Example 1 - Basic Tile Map
Figure 2.2 shows the standard leaflet map with a minimal reproducible example. This map shows the whole world in a Mercator projection. The map is interactive, just like most of the maps you come across in standard apps and websites on the modern internet. You can zoom in and see the details of an area like Los Angeles, with all the annotation and standard roads, forests, city names, etc.
The code is simply two lines with two actionable functions.
leaflet() - this function identifies the type of visualization.
addTiles() - this function tells the leaflet map to display the default visualization ‘tile’. Tiles are the way geospatial data are rendered depending on the zoom level that makes it work for not displaying too much detail as you zoom out.
leaflet() |>addTiles()
Figure 2.2: A very basic Leaflet map
A map is useful by itself, but we have not added any information to it. A map alone is not going to be useful as a visualization unless we add some data.
2.3.3.2 Example 2. Tile Map with Location Markers
Remember that locations data frame we created earlier. Let’s add that to the map.
addMarkers() is a useful function to add point data to a map.
Figure 2.3 shows the map with my two locations on it added as markers. We added another line of code with a pipe operator to put the locations data on the map.
Whoot! A map with locations! And now the spatial extent of the map defaults to just show an area that encompasses the markers in the locations dataset. If your location that you added to the dataset happens to be outside California, your map will be zoomed way out compared to someone who only includes locations in SoCal.
I purposely made the column names lat and lng because then we don’t have to assign those variables in the addMarkers call. If your columns are named something else (e.g., Lati or long), leaflet will need to be told that those are the correct columns to look for.
2.3.3.3 Example 3. North Carolina Counties on a Tile Map
Going beyond point locations, geospatial data really shines when we display shapes. We’ll start by displaying the nc dataframe.
addPolygons() is the function used to display polygons.
Figure 2.4: Leaflet map with North Carolina Counties
We have done it!
2.3.3.4 Exercise 2
Make a leaflet map with the SoCalEJ dataset. It should look like Figure 2.5.
Figure 2.5: Leaflet map with Calenviroscreen 4.0
2.3.4 Improve the Visualization
The basic visualization is in need of some improvement. First let’s explore colors, fillcolors, stroke, and opacity.
Within the addPolygons() function, there are many options that can be modified to alter the output from the default settings. We won’t cover all these options today.
color
weight
opacity
fillColor
fillOpacity
stroke
group
label
Let’s iterate and make some improvements!
Figure 2.6 shows what happens we just change the color from blue to black. Both the fill and line color are changed.
leaflet() |>addTiles() |>addPolygons(data = nc, color ='black')
Figure 2.6: Leaflet map with North Carolina Counties and color changed to black
The lines are too heavy. Let’s lower the weight - we’ll let the color go back to blue to keep the example minimal.
Figure 2.7 shows the result, which is a much cleaner look.
Figure 2.7: Leaflet map for North Carolina Counties with line weight changed to 1.
Much better!
Within the North Carolina dataset are some numerical values by county. Let’s use fillcolor the counties to indicate the values.
Unfortunately, leaflet requires us to do a bit of coding on our color palette.
Functions that help define the color palette in leaflet are
colorNumeric() - a linear color scale
colorQuantile() - a color scale that makes sure the bin sizes are approximately equal
colorFactor() - a color scale that is good for categorical (‘non-numeric’) data
This creates a numeric color palette for the BIR79 category of data for North Carolina in a numeric and quantile format. I have chosen the YlGn palette from Figure 1.6.
Figure 2.8: Leaflet map for North Carolina Counties with numeric palette for BIR79 and line weight = 1.
Increasing the fillOpacity to 0.8 will help us to see the data much better. Also, let’s remove the lines completely by setting stroke to FALSE. Figure 2.9 shows
Figure 2.9: Leaflet map for North Carolina Counties with numeric palette for BIR79, increasing opacity, removing lines.
That has a much better pop and we can really see the county differences in the high and low population areas.
Figure 2.10 shows the quantile binned palette pal2. Now the number of counties in each color bin is about the same, which helps to sort the differences among the lowest population counties.
Figure 2.10: Leaflet map for North Carolina Counties with quantile palette for BIR79, opacity optimized, and lines removed.
2.3.5 Exercise 3
Select a different color palette from Figure 1.6 to replace ‘YlGn’ in pal1.
Generate a North Carolina map with your choice of color palette for the BIR79 category.
2.3.5.1 Adding a Legend
The existing map is not bad, but the color scale is not self-explanatory. We need to add a legend so a user can understand the data scale.
addLegend() will provides that functionality.
Figure 2.11 shows the legend overlay with the map. Note that we need to either move the data = nc into the initial call to leaflet() or define it separately in both the addPolygons and addLegend functions.
The addLegend polygon require the inputs for pal and values. Defining the color palette pal1 and the values of the scale ~BIR79 are required. The title is optional, but the scale doesn’t make much sense without a description.
leaflet(data = nc) |>addTiles() |>addPolygons(stroke =FALSE,fillColor =~pal1(BIR79),fillOpacity =0.8) |>addLegend(pal = pal1, title ='Births in 1979', values =~BIR79)
Figure 2.11: Leaflet map for North Carolina Counties with numeric palette for BIR79, color legend, opacity optimized, and lines removed.
2.3.5.2 Exercise 4
Create a colorPalette for a SoCalEJ dataset variable. Choose one of the following numerical variables and a colorPalette from Figure 1.6
Poverty
Hispanic
AfricanAm
Ozone
DieselPM_P
Create a map for the SoCalEJ dataset with census tracts color coded for your chosen category of the following categories: